"HeartRisk Explorer: Predictive Analytics for Cardiovascular Health"¶

Table of Contents¶

1. Introduction¶

  • Project overview: Analyzing factors influencing heart attacks.

2. Importing the Libraries¶

  • Loading necessary Python libraries.

3. Reading the Files¶

  • Importing and exploring the dataset.

4. Visualizations¶

4.1 Resting Blood Pressure

   - Patterns and distribution analysis.

4.2 Cholesterol Levels

   - Impact on heart attacks.

4.3 Maximum Heart Rate

   - Relationship exploration.

4.4 Fasting Blood Sugar

   - Influence on heart attack likelihood.

4.5 Resting Electrocardiographic Results

   - Correlation with heart attacks.

4.6 Exercise-Induced Angina

   - Connection to heart attacks.

4.7 Person Getting Heart Attack

   - Distribution analysis.

4.8 Thallium Stress Test Results

   - Examination of Thall values.

4.9 Number of Blood Vessels

   - Influence on heart attack risk.

4.10 Language Used by Doctor (SLP)

   - Potential correlation with heart attacks.

4.11 Two Variable Analysis

   - Relationships between two factors.

4.12 Cholesterol vs. Heart Attack

   - Specific relationship exploration.

4.13 Age vs. Heart Attack Analysis

   - Impact of age on heart attack likelihood.

5. Conclusions¶

  • Summary of key findings.

6. Future Directions¶

  • Recommendations for future research.

Introduction:¶

The Heart Attack Dataset offers valuable insights into factors associated with the likelihood of experiencing a heart attack. With a focus on key health indicators, this dataset provides detailed information on various attributes related to cardiac health. Each entry captures a snapshot of an individual's health profile, contributing to a comprehensive understanding of potential risk factors for heart attacks.

Key Features:

  1. Age: The age of the patient, reflecting the demographic distribution within the dataset.

  2. Gender (Sex): A binary indicator denoting the gender of the patient, with 1 representing male and 0 representing female.

  3. Chest Pain Type (cp): Differentiating between types of chest pain experienced, including typical angina, atypical angina, non-anginal pain, and asymptomatic cases.

  4. Resting Blood Pressure (trtbps): The resting blood pressure of the individual, measured in millimeters of mercury (mm Hg).

  5. Cholesterol Level (chol): The cholesterol level in milligrams per deciliter (mg/dl), providing insight into lipid profiles.

  6. Fasting Blood Sugar (fbs): An indicator of fasting blood sugar levels exceeding 120 mg/dl (1 for true, 0 for false).

  7. Resting Electrocardiographic Results (restecg): Classifying resting electrocardiographic results as normal, ST-T wave abnormality, or left ventricular hypertrophy.

  8. Maximum Heart Rate Achieved (thalachh): The maximum heart rate achieved during activities, a crucial cardiovascular parameter.

  9. Exercise-Induced Angina (exng): A binary indicator for the presence of exercise-induced angina (1 for yes, 0 for no).

  10. Depression of the ST Segment (oldpeak): The depression of the ST segment induced by exercise relative to rest, offering insights into cardiac stress during physical activity.

  11. Slope of the Peak Exercise ST Segment (slp): Describing the slope of the peak exercise ST segment.

  12. Number of Major Vessels (caa): Quantifying the presence of major vessels (ranging from 0 to 3), a crucial factor in cardiovascular health.

  13. Thalassemia Type (thall): Characterizing the thalassemia type, providing additional information on blood disorders.

Target Variable:

  • Likelihood of a Heart Attack (output): The binary target variable indicating the likelihood of a heart attack—0 suggesting less chance and 1 indicating a higher likelihood based on the collective features.

By exploring this dataset, we aim to uncover patterns, correlations, and predictive relationships that can contribute to a better understanding of cardiovascular health and facilitate early detection of potential heart attack risks.

In [1]:
#Importing the Libraries
import numpy as np 
import pandas as pd 
import scipy as sp
import re
import time
import matplotlib.pyplot as plt
import seaborn as sns
import os
import plotly.express as px
import matplotlib as mpl
import warnings
warnings.filterwarnings('ignore')
In [2]:
#Reading the Files
df = pd.read_csv("C:/Users/MANOJ S/Downloads/archive (14)/heart.csv")
df
Out[2]:
age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
298 57 0 0 140 241 0 1 123 1 0.2 1 0 3 0
299 45 1 3 110 264 0 1 132 0 1.2 1 0 3 0
300 68 1 0 144 193 1 1 141 0 3.4 1 2 3 0
301 57 1 0 130 131 0 1 115 1 1.2 1 1 3 0
302 57 0 1 130 236 0 0 174 0 0.0 1 1 2 0

303 rows × 14 columns

Finding the shape of the data¶

In [3]:
df.shape
Out[3]:
(303, 14)

Inference¶

Data contains 303 rows and 14 columns

Finding the top 5 elements¶

In [4]:
df.head()
Out[4]:
age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
In [5]:
df.info
Out[5]:
<bound method DataFrame.info of      age  sex  cp  trtbps  chol  fbs  restecg  thalachh  exng  oldpeak  slp  \
0     63    1   3     145   233    1        0       150     0      2.3    0   
1     37    1   2     130   250    0        1       187     0      3.5    0   
2     41    0   1     130   204    0        0       172     0      1.4    2   
3     56    1   1     120   236    0        1       178     0      0.8    2   
4     57    0   0     120   354    0        1       163     1      0.6    2   
..   ...  ...  ..     ...   ...  ...      ...       ...   ...      ...  ...   
298   57    0   0     140   241    0        1       123     1      0.2    1   
299   45    1   3     110   264    0        1       132     0      1.2    1   
300   68    1   0     144   193    1        1       141     0      3.4    1   
301   57    1   0     130   131    0        1       115     1      1.2    1   
302   57    0   1     130   236    0        0       174     0      0.0    1   

     caa  thall  output  
0      0      1       1  
1      0      2       1  
2      0      2       1  
3      0      2       1  
4      0      2       1  
..   ...    ...     ...  
298    0      3       0  
299    0      3       0  
300    2      3       0  
301    1      3       0  
302    1      2       0  

[303 rows x 14 columns]>
In [6]:
df.describe()
Out[6]:
age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 0.683168 0.966997 131.623762 246.264026 0.148515 0.528053 149.646865 0.326733 1.039604 1.399340 0.729373 2.313531 0.544554
std 9.082101 0.466011 1.032052 17.538143 51.830751 0.356198 0.525860 22.905161 0.469794 1.161075 0.616226 1.022606 0.612277 0.498835
min 29.000000 0.000000 0.000000 94.000000 126.000000 0.000000 0.000000 71.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 47.500000 0.000000 0.000000 120.000000 211.000000 0.000000 0.000000 133.500000 0.000000 0.000000 1.000000 0.000000 2.000000 0.000000
50% 55.000000 1.000000 1.000000 130.000000 240.000000 0.000000 1.000000 153.000000 0.000000 0.800000 1.000000 0.000000 2.000000 1.000000
75% 61.000000 1.000000 2.000000 140.000000 274.500000 0.000000 1.000000 166.000000 1.000000 1.600000 2.000000 1.000000 3.000000 1.000000
max 77.000000 1.000000 3.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000000 6.200000 2.000000 4.000000 3.000000 1.000000
In [7]:
df.isnull().values.any()
Out[7]:
False

It shows it has no null values

Showing unique values in the dataset¶

In [8]:
df.nunique()
Out[8]:
age          41
sex           2
cp            4
trtbps       49
chol        152
fbs           2
restecg       3
thalachh     91
exng          2
oldpeak      40
slp           3
caa           5
thall         4
output        2
dtype: int64

Visualisations¶

In [9]:
plt.figure(figsize=(15,8))
plt.title('Age VS Count',size=20)
sns.histplot(df['age'],bins=48)
plt.xticks(list(range(29,79,1)))
plt.yticks(list(range(0,20,1)))
plt.show()

Inference¶

The histogram shows that the majority of patients in the dataset are between 40 and 60 years old. There are also a significant number of patients in their 30s and 70s. The fewest patients are in their 20s and 80s.

In [10]:
sex=df['sex'].value_counts()
plt.title('Finding the sex',size=20)
sns.barplot(x=sex.index,y=sex.values)
print(df.sex.value_counts())
1    207
0     96
Name: sex, dtype: int64
In [11]:
plt.figure(figsize=(6, 4))
sex = df['sex'].value_counts().reset_index()
custom_colors = ['#3498db', '#e74c3c']
fig = px.pie(sex, names='index', values='sex', color_discrete_sequence=custom_colors)
fig.update_layout(title_text='Distribution of Gender in the Dataset', width=600, height=400)
fig.show()
<Figure size 600x400 with 0 Axes>

Data shows 207 males and 96 females¶

In [12]:
cp=df['cp'].value_counts()
plt.title('Finding the type of chest pain',size=20)
sns.barplot(x=cp.index,y=cp.values)
print(df.cp.value_counts())
0    143
2     87
1     50
3     23
Name: cp, dtype: int64

Value 1 typical angina=143

Value 2 atypical angina=87

Value 3 non-anginal pain=50

Value 4 asymptomatic=23

Resting blood pressure (in mm Hg)¶

In [13]:
plt.figure(figsize=(20,20))
sns.displot(df['trtbps'])
plt.xticks(list(range(90,200,10)))
plt.show()
print(df.trtbps.value_counts())
<Figure size 2000x2000 with 0 Axes>
120    37
130    36
140    32
110    19
150    17
138    13
128    12
160    11
125    11
112     9
132     8
118     7
124     6
135     6
108     6
152     5
134     5
145     5
122     4
170     4
100     4
105     3
126     3
115     3
180     3
136     3
142     3
102     2
148     2
178     2
94      2
144     2
146     2
200     1
114     1
154     1
123     1
192     1
174     1
165     1
104     1
117     1
101     1
156     1
106     1
155     1
129     1
172     1
164     1
Name: trtbps, dtype: int64

It shows Bp in ascending order values ranging from 94 to 200

In [14]:
plt.figure(figsize=(15,8))
plt.title('Trtbps VS Count',size=20)
sns.histplot(df['trtbps'],bins=115)
plt.xticks(list(range(90,205,5)))
#plt.yticks(list(range(0,20,1)))
plt.show()

Cholestoral in mg/dl fetched via BMI sensor¶

In [15]:
plt.figure(figsize=(15,8))
plt.title('Cholestoral VS Count',size=20)
sns.histplot(df['chol'])
#plt.xticks(list(range(29,79,1)))
#plt.yticks(list(range(0,20,1)))
plt.show()

Maximum heart rate achieved¶

In [16]:
plt.figure(figsize=(15,8))
plt.title('Thalachh VS Count',size=20)
sns.histplot(df['thalachh'])
plt.xticks(list(range(70,210,5)))
plt.yticks(list(range(0,60,2)))
plt.show()

Shows data ranging from 71 to 202

Fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)¶

In [17]:
fbs=df['fbs'].value_counts()
plt.title('Finding the FBS',size=20)
sns.barplot(x=fbs.index,y=fbs.values)
print(df.fbs.value_counts())
0    258
1     45
Name: fbs, dtype: int64

There are 258 false values and 45 true values

In [18]:
plt.figure(figsize=(6,4))
fbs=df['fbs'].value_counts().reset_index()
px.pie(fbs,names='index',values='fbs')
<Figure size 600x400 with 0 Axes>

Data of Resting Electrocardiographic Results¶

In [19]:
restecg=df['restecg'].value_counts()
plt.title('Resting Electrocardiographic Results',size=20)
sns.barplot(x=restecg.index,y=restecg.values)
plt.show()
print(df.restecg.value_counts())
1    152
0    147
2      4
Name: restecg, dtype: int64

152: normal

147: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

4: showing probable or definite left ventricular hypertrophy by Estes criteria

Exercise Induced angina (1 = yes; 0 = no)¶

In [20]:
exng=df['exng'].value_counts()
plt.title('Exercise induced angina (1 = yes; 0 = no)',size=20)
sns.barplot(x=exng.index,y=exng.values)
plt.show()
print(df.exng.value_counts())
0    204
1     99
Name: exng, dtype: int64

Person getting heart attack¶

In [21]:
output=df['output'].value_counts()
plt.title(' 0= less chance of heart attack 1= more chance of heart attack ',size=20)
sns.barplot(x=output.index,y=output.values)
plt.show()
print(df.output.value_counts())
1    165
0    138
Name: output, dtype: int64

Data shows 165 people have more chance of heart attack and 138 people have less chance

Analysing the values of thall¶

In [22]:
thall=df['thall'].value_counts()
plt.title('Thall vs Counts')
sns.barplot(x=thall.index,y=thall.values)
plt.xlabel('Thall',size=20)
plt.ylabel('Counts',size=20)
print(df.thall.value_counts())
2    166
3    117
1     18
0      2
Name: thall, dtype: int64

The data shows that 2 ~ 166 , 3 ~ 117 , 1 ~ 18 ,0 ~ 2 are the values of thall with thier index

CAA shows number of blood vessels¶

In [23]:
caa=df['caa'].value_counts()
plt.title('CAA vs Counts')
sns.barplot(x=caa.index,y=caa.values)
plt.xlabel('CAA',size=20)
plt.ylabel('Counts',size=20)
print(df.caa.value_counts())
0    175
1     65
2     38
3     20
4      5
Name: caa, dtype: int64

Data shows the SLP or language used by doctor to cure patients¶

In [24]:
slp=df['slp'].value_counts()
plt.title('SLP vs Counts')
sns.barplot(x=slp.index,y=slp.values)
plt.xlabel('SLP',size=20)
plt.ylabel('Counts',size=20)
print(df.slp.value_counts())
2    142
1    140
0     21
Name: slp, dtype: int64

Analysis using two variables¶

In [25]:
df[['age','output']].value_counts().sort_values()
Out[25]:
age  output
77   0          1
74   1          1
76   1          1
70   1          1
69   0          1
               ..
59   0          9
52   1          9
57   0         10
54   1         10
58   0         12
Length: 75, dtype: int64
In [26]:
A=pd.crosstab(df['age'],df['output']).reset_index()
A.columns=['age','lowrisk','highrisk']
A.head(10)
Out[26]:
age lowrisk highrisk
0 29 0 1
1 34 0 2
2 35 2 2
3 37 0 2
4 38 1 2
5 39 1 3
6 40 2 1
7 41 1 9
8 42 1 7
9 43 3 5

Data shows age with risk of heart attack

In [27]:
px.line(A,A['age'],A['highrisk'],range_x=(25,80),title='Age with High chance of getting stroke')

Data shows that people with age between 50 and 55 are most prone to getting a stroke

In [28]:
px.line(A,A['age'],A['lowrisk'],range_x=(25,80),title='Age with Low chance of getting stroke')

At age 58 it the most chance of getting a heart stroke while it increases with increase in age

In [29]:
plt.figure(figsize=(10,8))
sns.lmplot(x='age',y='trtbps',hue='output',data=df)
plt.title('Shows relation between age and Resting blood pressure',size=20)
plt.show()
<Figure size 1000x800 with 0 Axes>

As age increases both factors increases but less chance of getting heart attack increses more as compared with risk of getting attack

Relation between getting heart attack with Cholestrol level¶

In [30]:
plt.figure(figsize=(10,8))
sns.lmplot(x='age',y='chol',hue='output',data=df)
plt.title('Age vs cholestrol with respect to heart attack',size=20)
plt.show()
<Figure size 1000x800 with 0 Axes>

At age around 40 there is less change of getting a heart attack but as age increases getting heart attack is more as comparing with cholestrol level

In [31]:
plt.figure(figsize=(10,8))
sns.lmplot(x='age',y='thalachh',hue='output',data=df)
plt.title('Age vs Maximum heart rate achieved with respect to heart attack',size=20)
plt.show()
<Figure size 1000x800 with 0 Axes>

Maximum heart rate decreases with increases in age and chances of getting risk of heart attack decreses and vice versa

In [32]:
sns.kdeplot(x='chol',y='output',data=df,shade=True)
Out[32]:
<AxesSubplot:xlabel='chol', ylabel='output'>
In [33]:
sns.scatterplot(x='chol',y='output',data=df)
Out[33]:
<AxesSubplot:xlabel='chol', ylabel='output'>
In [34]:
sns.histplot(x='chol',y='output',data=df)
Out[34]:
<AxesSubplot:xlabel='chol', ylabel='output'>
In [35]:
sns.swarmplot(y='chol',x='output',data=df)
Out[35]:
<AxesSubplot:xlabel='output', ylabel='chol'>

Age vs Stroke Analysis¶

In [36]:
plt.style.use('classic')
sns.histplot(x='age',data=df,hue='output',palette='pastel')
plt.title('Age vs stroke analysis',size=20)
plt.grid()
plt.show()

This shows that the chances of getting a stroke is more in the age between 40 to 50 and as the age increases the changes of getting more prone to stroke decreases

Analysing the Numerical Variables¶

In [60]:
column_renaming = {
    'age': 'Age',
    'sex': 'Gender',
    'cp': 'ChestPainType',
    'trtbps': 'RestingBloodPressure',
    'chol': 'Cholesterol',
    'fbs': 'FastingBloodSugar',
    'restecg': 'RestEcg',
    'thalachh': 'MaxHeartRate',
    'exng': 'ExerciseInducedAngina',
    'oldpeak': 'OldPeak',
    'slp': 'StSlope',
    'caa': 'VesselsCount',
    'thall': 'Thalassemia',
    'output': 'HeartAttackOutput'
}

data = df.rename(column_renaming,axis=1)
In [61]:
columns = data.columns.copy()
columns
Out[61]:
Index(['Age', 'Gender', 'ChestPainType', 'RestingBloodPressure', 'Cholesterol',
       'FastingBloodSugar', 'RestEcg', 'MaxHeartRate', 'ExerciseInducedAngina',
       'OldPeak', 'StSlope', 'VesselsCount', 'Thalassemia',
       'HeartAttackOutput'],
      dtype='object')
In [39]:
data.head()
Out[39]:
Age Gender ChestPainType RestingBloodPressure Cholesterol FastingBloodSugar RestEcg MaxHeartRate ExerciseInducedAngina OldPeak StSlope VesselsCount Thalassemia HeartAttackOutput
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1

Inital Inference: These features can play a significant role in the possibility of heart attack.

Age: Age is a significant factor, as the risk of a heart attack tends to increase with age.

Gender: Gender differences may impact the likelihood of heart attacks, with men generally having a higher risk.

Chest Pain Type (cp): The type of chest pain can provide insights into cardiac conditions.

Resting Blood Pressure (trtbps): Elevated blood pressure is a well-known risk factor for cardiovascular issues.

Cholesterol Level (chol): High cholesterol levels are associated with an increased risk of heart disease.

In [40]:
numerical_columns = ['Age','RestingBloodPressure','Cholesterol','MaxHeartRate','OldPeak']
In [41]:
def dist_box(data,feature=None,size=(20,5)):
    fig = plt.figure(figsize=size)
    a = fig.add_axes([0.0,0.0,0.5,.5])
    b = fig.add_axes([0.0,0.52,0.5,.2])

    sns.boxplot(data[feature],orient='h',color='b',ax=b)
    b.set_xticks([])
    b.set_yticks([])
    sns.histplot(data[feature],bins=30,color='b',ax=a)
    a.set_xlabel('')
    plt.title(feature)
In [42]:
data[numerical_columns].describe()
Out[42]:
Age RestingBloodPressure Cholesterol MaxHeartRate OldPeak
count 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 131.623762 246.264026 149.646865 1.039604
std 9.082101 17.538143 51.830751 22.905161 1.161075
min 29.000000 94.000000 126.000000 71.000000 0.000000
25% 47.500000 120.000000 211.000000 133.500000 0.000000
50% 55.000000 130.000000 240.000000 153.000000 0.800000
75% 61.000000 140.000000 274.500000 166.000000 1.600000
max 77.000000 200.000000 564.000000 202.000000 6.200000
In [43]:
for features in numerical_columns:
    dist_box(data,features)
In [63]:
# Replacing the outliers with median
def replace_outliers_with_median(series, multiplier=1.5):
    median_value = series.median()
    deviation = multiplier * (series.quantile(0.75) - series.quantile(0.25))
    series[(series < series.quantile(0.25) - deviation) | (series > series.quantile(0.75) + deviation)] = median_value

# Columns in your DataFrame
columns_to_replace_outliers = ['Age', 'RestingBloodPressure', 'Cholesterol', 'MaxHeartRate', 'OldPeak']

# Replace outliers in each selected column
for column in columns_to_replace_outliers:
    replace_outliers_with_median(data[column])
In [66]:
for features in numerical_columns:
    dist_box(data,features)
In [68]:
print("\033[1m"+"Feature\t\t\tSkew"+"\033[0m",'')
print(data[numerical_columns].skew())

print("\033[1m"+"="*32+"\033[0m",'')

print("\033[1m"+"Feature\t\t\tKurtosis"+"\033[0m",'')
print(data[numerical_columns].kurtosis())
Feature			Skew 
Age                    -0.202463
RestingBloodPressure    0.243470
Cholesterol             0.184743
MaxHeartRate           -0.438375
OldPeak                 0.973920
dtype: float64
================================ 
Feature			Kurtosis 
Age                    -0.542167
RestingBloodPressure   -0.145283
Cholesterol            -0.294654
MaxHeartRate           -0.348190
OldPeak                 0.104297
dtype: float64

Summary of Numerical Column Observations:

Age:

The ages of the individuals in our dataset show a slight tendency to be on the younger side. This is indicated by a slight negative skewness, suggesting that more people are slightly younger than the median age. The distribution is fairly typical, with a moderate concentration of ages around the average.

Resting Blood Pressure:

The distribution of resting blood pressure leans towards higher values. This means that most people in our dataset have resting blood pressure slightly higher than the median. The distribution is somewhat peaked and shows a heavier tail on the right side, indicating some individuals with significantly higher blood pressure.

Cholesterol:

Cholesterol levels exhibit a notable positive skewness, indicating that the majority of individuals have higher cholesterol values. The distribution is not only skewed but also has heavier tails on the right side, Which claims the presence of some individuals with much higher cholesterol levels.

Max Heart Rate:

Maximum heart rates show a slight tendency to be on the lower side, as indicated by a slight negative skewness. The distribution is fairly typical, with a moderate concentration of heart rates around the average. There is no significant skewness or kurtosis indicating extreme values.

OldPeak:

The distribution of OldPeak values appears somewhat uniform, with a mean of 1.04 and a standard deviation of 1.16. Notably, there is a substantial frequency of individuals with an OldPeak value of 0, indicating a prevalent absence of old peaks during stress testing. The values range from 0 to 6.2. Also their are some extreme values on the higher side

Analysing the Categorical Variables¶

In [69]:
categorical_features = [i for i in columns if i not in numerical_columns]
print("\033[1m"+f"{', '.join(categorical_features)}"+"\033[0m",'')
Gender, ChestPainType, FastingBloodSugar, RestEcg, ExerciseInducedAngina, StSlope, VesselsCount, Thalassemia, HeartAttackOutput 
In [70]:
target='HeartAttackOutput'
Gender = {0: 'female', 1: 'male'}


# ChestPainType = {0: 'typical angina', 1: 'atypical angina', 2: 'non-anginal pain', 3: 'asymptomatic'}

FastingBloodSugar = {0: 'lower than 120mg/ml', 1: 'greater than or equal to 120mg/ml'}
# RestEcg = {0: 'normal', 1: 'ST-T wave abnormality', 2: 'left ventricular hypertrophy'}
ExerciseInducedAngina = {0: 'no', 1: 'yes'}

# StSlope = {0: 'normal', 1: 'fixed defect', 2: 'reversible defect'}


VesselsCount = {0: 'None', 1: 'One', 2: 'Two', 3: 'Three',4:'Four'}


# Thalassemia = {0: 'normal', 1: 'fixed defect', 2: 'reversible defect'}


HeartAttackOutput  = {0: 'no heart attack', 1: 'heart attack'}
In [71]:
def cat_plot(data, labels=None, hue=None, ax=None):
    data_counts = data.value_counts(ascending=True)
    if labels is None:
        labels = data_counts.index
    
    bar = sns.barplot(x=labels, y=data_counts.values, hue=hue, ax=ax)
    for i in bar.containers:
        bar.bar_label(i, label_type='edge', fontsize=8)
In [72]:
fig, axes = plt.subplots(3, 3, figsize=(15, 15))

categorical_features = ['Gender', 'ChestPainType', 'FastingBloodSugar', 'RestEcg', 'ExerciseInducedAngina', 'StSlope', 'VesselsCount', 'Thalassemia', 'HeartAttackOutput']
feature_labels = ['Gender', 'ChestPainType', 'FastingBloodSugar', 'RestEcg', 'ExerciseInducedAngina', 'StSlope', 'VesselsCount', 'Thalassemia', 'HeartAttackOutput']

for i, (ax, feature) in enumerate(zip(axes.flat, categorical_features)):
    cat_plot(data[feature], labels=None, hue=None, ax=ax)
    ax.set_title(feature_labels[i])

plt.tight_layout()
plt.show()

Gender:

The dataset exhibits a notable gender imbalance, with 207 male individuals and 96 female individuals.

Chest Pain Type (cp):

The most prevalent chest pain type is Type 0, indicating typical angina, with the second-highest frequency found in Type 2. Type 3, representing non-anginal pain, has the lowest frequency, occurring only in 23 individuals.

Fasting Blood Sugar (fbs):

A majority of individuals in the dataset have normal fasting blood sugar levels (45 people), suggesting a skewed distribution towards lower sugar levels.

Resting Electrocardiographic Results (restecg):

The dataset sees a higher prevalence of Type 1 resting electrocardiographic results (152 occurrences), closely followed by Type 0 (147 occurrences). Type 2 is less common, appearing only 4 times.

Exercise-Induced Angina (exng):

Exercise-induced angina (Type 0) is more prevalent, observed in 204 individuals, compared to Type 1, which occurs in 99 individuals.

ST Slope (slp):

The most frequent ST slope is Type 2, observed 142 times, followed by Type 1 with 140 occurrences. Type 2 has the lowest frequency, appearing only 21 times.

Vessels Count (caa):

The absence of colored vessels by fluoroscopy (Type 0) is the most common, occurring in 175 individuals. Type 1 has 65 occurrences, Type 2 has 38, and Type 3 has 20.

Thalassemia (thall):

Type 2 thalassemia has the highest frequency, followed by Type 3.

Heart Attack Output (target):

The target variable shows a slight imbalance, with 138 occurrences indicating less chance of a heart attack (0) and 165 occurrences indicating a higher chance (1).

In [73]:
data.columns
Out[73]:
Index(['Age', 'Gender', 'ChestPainType', 'RestingBloodPressure', 'Cholesterol',
       'FastingBloodSugar', 'RestEcg', 'MaxHeartRate', 'ExerciseInducedAngina',
       'OldPeak', 'StSlope', 'VesselsCount', 'Thalassemia',
       'HeartAttackOutput'],
      dtype='object')
In [74]:
numerical_columns
Out[74]:
['Age', 'RestingBloodPressure', 'Cholesterol', 'MaxHeartRate', 'OldPeak']
In [75]:
sns.pairplot(data=data,x_vars=numerical_columns,y_vars=numerical_columns,hue=target)
Out[75]:
<seaborn.axisgrid.PairGrid at 0x170ff744850>

Findings from Numerical Features :

Likelihood of Heart Attack in Youngsters:

There is an observed trend indicating that youngsters are more likely to have a higher chance of experiencing a heart attack. This finding suggests age as a potential factor influencing the likelihood of heart attacks, with a notable increase in occurrences among younger individuals.

Age-Related Decrease in Max Heart Rate:

Another trend identified is a decrease in maximum heart rate with age. As individuals age, there appears to be a decline in the maximum heart rate achieved. This inverse relationship between age and maximum heart rate could signify age-related changes in cardiovascular function.

While these initial trends have been identified, it is acknowledged that the observed patterns may appear relatively weak. Further analysis and exploration are warranted to delve deeper into the complexities of the dataset, considering potential confounding factors or interactions between variables that may influence the identified trends.

In [76]:
def cat_vs_heartattack(data,features,target):
    try:
        for i in range(len(features)):
            plt.subplot(3,3,i+1)
            fig  = sns.countplot(data=data,x=features[i],hue=target,palette='bright')
            for i in fig.containers:
                fig.bar_label(i)
                
            
            plt.tight_layout()
            plt.legend('')
    except:
        pass
In [77]:
figg = plt.figure(figsize=(12,12))
cat_vs_heartattack(data,categorical_features[:-1],target)

or_patch = mpl.patches.Patch(color='#DF7D20', label='1 - HeartAttack')
bl_patch = mpl.patches.Patch(color='#224FDF', label='0 - HeartAttack')

figg.legend(handles=[or_patch,bl_patch],bbox_to_anchor=(.9,.24),handleheight=4,handlelength=3)
plt.show()

Findings from Categorical Features:

Gender:

Females show a higher proportion of heart attacks (72 yes, 24 no), while males exhibit a more balanced distribution (93 yes, 114 no). This suggests potential gender-based differences in heart attack occurrences.

Chest Pain Type (cp):

Chest pain type 2 appears to have a higher likelihood of heart attacks (69 yes, 18 no). Type 3 and Type 1 also show heart attacks, but Type 0 has a comparatively lower frequency, indicating a potential association between chest pain type and heart attacks.

Fasting Blood Sugar (fbs):

Blood sugar levels do not seem to exhibit a clear effect on heart attacks, as the distribution is relatively balanced between yes and no categories.

Resting Electrocardiographic Results (restecg):

Observation: Resting electrocardiographic results of type 1 indicate a higher likelihood of heart attacks (96 yes, 56 no), while type 0 exhibits a more balanced distribution. Type 2 has a lower occurrence, suggesting a potential association between ECG results and heart attacks.

Exercise-Induced Angina (exng):

Individuals with exercise-induced angina of type 0 show a higher likelihood of heart attacks (142 yes, 62 no). Type 1 has a lower occurrence of heart attacks (23 yes, 76 no).

ST Slope (slp):

ST slope type 2 is associated with a higher likelihood of heart attacks (107 yes, 35 no), suggesting a potential correlation between the slope of the peak exercise ST segment and heart attack occurrences.

Vessels Count (caa):

Vessels count type 0 demonstrates a higher frequency of heart attacks (130 yes, 45 no). Other vessel count types have comparatively fewer occurrences of heart attacks.

Thalassemia (thall):

Thalassemia type 2 is associated with a higher likelihood of heart attacks (130 yes, 36 no), while other types generally show fewer occurrences of heart attacks.

In [78]:
data[numerical_columns].describe()
Out[78]:
Age RestingBloodPressure Cholesterol MaxHeartRate OldPeak
count 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 130.092409 242.993399 150.132013 0.971617
std 9.082101 15.130275 44.721507 22.167986 1.041070
min 29.000000 94.000000 126.000000 90.000000 0.000000
25% 47.500000 120.000000 211.000000 136.000000 0.000000
50% 55.000000 130.000000 240.000000 153.000000 0.800000
75% 61.000000 140.000000 272.000000 166.000000 1.600000
max 77.000000 170.000000 360.000000 202.000000 4.000000
In [79]:
data.describe()
Out[79]:
Age Gender ChestPainType RestingBloodPressure Cholesterol FastingBloodSugar RestEcg MaxHeartRate ExerciseInducedAngina OldPeak StSlope VesselsCount Thalassemia HeartAttackOutput
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.366337 0.683168 0.966997 130.092409 242.993399 0.148515 0.528053 150.132013 0.326733 0.971617 1.399340 0.729373 2.313531 0.544554
std 9.082101 0.466011 1.032052 15.130275 44.721507 0.356198 0.525860 22.167986 0.469794 1.041070 0.616226 1.022606 0.612277 0.498835
min 29.000000 0.000000 0.000000 94.000000 126.000000 0.000000 0.000000 90.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 47.500000 0.000000 0.000000 120.000000 211.000000 0.000000 0.000000 136.000000 0.000000 0.000000 1.000000 0.000000 2.000000 0.000000
50% 55.000000 1.000000 1.000000 130.000000 240.000000 0.000000 1.000000 153.000000 0.000000 0.800000 1.000000 0.000000 2.000000 1.000000
75% 61.000000 1.000000 2.000000 140.000000 272.000000 0.000000 1.000000 166.000000 1.000000 1.600000 2.000000 1.000000 3.000000 1.000000
max 77.000000 1.000000 3.000000 170.000000 360.000000 1.000000 2.000000 202.000000 1.000000 4.000000 2.000000 4.000000 3.000000 1.000000
In [107]:
by_groups  = data.copy()


age_labels=np.array([['20-30',0], ['30-40',1], ['40-50',2], ['50-60',3], ['60-70',4], ['70-80',5]])
by_groups['Age Group']  = pd.cut(data['Age'], bins=[20, 30, 40, 50, 60, 70, 80], labels=age_labels[:,0])

BP_labels = np.array([['Normal',0], ['Pre-Hypertension',1], ['Hypertension STG1',2], ['Hypertension STG2',3]])
by_groups['BP Level'] = pd.cut(data['RestingBloodPressure'], bins=[90, 120, 140, 160,200+100],labels=BP_labels[:,0] )

chol_labels = np.array([['Low(okay)',0],['Borderline High',1], ['High',2]])
by_groups['Cholesterol Level'] = pd.cut(data['Cholesterol'], bins=[100,200,240,1000], labels=chol_labels[:,0])

maxHeartRate_labels = np.array([['Low',0], ['Mid',1], ['High',2]])
by_groups['MaxHeartRate Level'] = pd.cut(data['MaxHeartRate'], bins=[0, 120, 140, float('inf')], labels= maxHeartRate_labels[:,0])

old_peek_labels = np.array([['Type 0'], ['Type 1'], ['Type 2'],['Type 3'],['Type 4']])
by_groups['Old Peak Type'] = pd.cut(data['OldPeak'], bins=[0, 1, 2,3 ,4,float('inf')], labels=old_peek_labels[:,0])
In [108]:
new_numerical_features = ['Age Group','BP Level','Cholesterol Level','MaxHeartRate Level','Old Peak Type']
In [109]:
figg = plt.figure(figsize=(20,12))
cat_vs_heartattack(by_groups,new_numerical_features,target)

or_patch = mpl.patches.Patch(color='#DF7D20', label='1 - HeartAttack')
bl_patch = mpl.patches.Patch(color='#224FDF', label='0 - HeartAttack')

figg.legend(handles=[or_patch,bl_patch],bbox_to_anchor=(.9,.6),handleheight=4,handlelength=3)
plt.show()

Revised Findings after Grouping by Type and Level:

Age and Heart Attack Probability:

The analysis reveals a distinct age-related pattern in the likelihood of heart attacks. Cases of heart attacks are notably higher in younger age groups, gradually decreasing after the mid-age range (50-60 years). Surprisingly, a resurgence in the probability of heart attacks is observed in the age bracket of 40 to 50 years. This nuanced age-related variation underscores the importance of considering age as a significant factor in predicting heart attack risks.

Blood Pressure Levels and Heart Attacks:

Individuals with normal blood pressure or pre-hypertension levels exhibit a higher incidence of heart attacks. This finding raises the possibility that individuals in this category may be unaware of their cardiovascular condition until experiencing a heart attack. In contrast, cases of heart attacks among those with hypertension (Stage 1 and 2) are on a decreasing trend, suggesting potential proactive measures taken by individuals in managing their health.

Cholesterol Levels and Heart Attacks:

There is an intriguing association between lower to borderline cholesterol levels and a higher frequency of heart attacks. This unexpected pattern challenges conventional expectations and warrants further investigation into the potential interplay of various factors influencing heart health.

Heart Rate and Heart Attacks:

Higher heart rates are correlated with an increased likelihood of heart attacks. This observation aligns with conventional expectations, indicating the importance of heart rate as a potential risk factor for cardiovascular events.

Chest Pain Types (Type 0 and Type 1):

Types 0 and 1 exhibit a higher frequency of individuals, suggesting that these chest pain types are more prevalent in the dataset. Understanding the characteristics and implications of these chest pain types could provide valuable insights into the overall cardiovascular health of the individuals.

These refined findings shed light on the nuanced relationships between various health indicators and the probability of heart attacks. Further analysis and exploration are recommended to uncover the underlying mechanisms and contributing factors.

In [110]:
by_groups.head()
Out[110]:
Age Gender ChestPainType RestingBloodPressure Cholesterol FastingBloodSugar RestEcg MaxHeartRate ExerciseInducedAngina OldPeak StSlope VesselsCount Thalassemia HeartAttackOutput Age Group BP Level Cholesterol Level MaxHeartRate Level Old Peak Type
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1 60-70 Hypertension STG1 Borderline High High Type 2
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1 30-40 Pre-Hypertension High High Type 3
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1 40-50 Pre-Hypertension Borderline High High Type 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1 50-60 Normal Borderline High High Type 0
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1 50-60 Normal High High Type 0
In [116]:
dummies = pd.get_dummies(by_groups, columns=new_numerical_features, dtype=int)

# If you want to keep the original columns as well, you can concatenate them
by_groups = pd.concat([by_groups, dummies], axis=1)
print(by_groups.columns)
by_groups = by_groups.loc[:,~by_groups.columns.duplicated()]
Index(['Age', 'Gender', 'ChestPainType', 'RestingBloodPressure', 'Cholesterol',
       'FastingBloodSugar', 'RestEcg', 'MaxHeartRate', 'ExerciseInducedAngina',
       'OldPeak',
       ...
       'Cholesterol Level_Borderline High', 'Cholesterol Level_High',
       'MaxHeartRate Level_Low', 'MaxHeartRate Level_Mid',
       'MaxHeartRate Level_High', 'Old Peak Type_Type 0',
       'Old Peak Type_Type 1', 'Old Peak Type_Type 2', 'Old Peak Type_Type 3',
       'Old Peak Type_Type 4'],
      dtype='object', length=124)
In [117]:
new_data = by_groups[['Gender', 'ChestPainType', 'FastingBloodSugar', 'RestEcg', 'ExerciseInducedAngina',
                      'StSlope', 'VesselsCount', 'Thalassemia',
                      'Age Group_30-40', 'Age Group_40-50', 'Age Group_50-60', 'Age Group_60-70', 'Age Group_70-80',
                      'BP Level_Pre-Hypertension', 'BP Level_Hypertension STG1', 'BP Level_Hypertension STG2',
                      'Cholesterol Level_Borderline High', 'Cholesterol Level_High',
                      'MaxHeartRate Level_Mid', 'MaxHeartRate Level_High',
                      'Old Peak Type_Type 1', 'Old Peak Type_Type 2', 'Old Peak Type_Type 3', 'Old Peak Type_Type 4',
                      'HeartAttackOutput']]
In [118]:
new_data.head()
Out[118]:
Gender ChestPainType FastingBloodSugar RestEcg ExerciseInducedAngina StSlope VesselsCount Thalassemia Age Group_30-40 Age Group_40-50 ... BP Level_Hypertension STG2 Cholesterol Level_Borderline High Cholesterol Level_High MaxHeartRate Level_Mid MaxHeartRate Level_High Old Peak Type_Type 1 Old Peak Type_Type 2 Old Peak Type_Type 3 Old Peak Type_Type 4 HeartAttackOutput
0 1 3 1 0 0 0 0 1 0 0 ... 0 1 0 0 1 0 1 0 0 1
1 1 2 0 1 0 0 0 2 1 0 ... 0 0 1 0 1 0 0 1 0 1
2 0 1 0 0 0 2 0 2 0 1 ... 0 1 0 0 1 1 0 0 0 1
3 1 1 0 1 0 2 0 2 0 0 ... 0 1 0 0 1 0 0 0 0 1
4 0 0 0 1 1 2 0 2 0 0 ... 0 0 1 0 1 0 0 0 0 1

5 rows × 25 columns

In [119]:
plt.figure(figsize=(19, 19))
sns.heatmap(new_data.corr(), annot=True, vmin=-1, vmax=1, linewidths=0.1, cmap='coolwarm', xticklabels=True, yticklabels=True, fmt='.2f')
plt.show()

Features: The features include characteristics of the patient, such as age, gender, and blood pressure, as well as results of medical tests, such as ECG and cholesterol level. There are also features related to the type of chest pain experienced by the patient.

Target variable: The target variable appears to be "HeartAttackOutput", which likely indicates whether or not the patient had a heart attack.

Color scale: The color scale ranges from blue (negative correlation) to yellow (positive correlation). A correlation of 0 is represented by white. Here are some specific examples of correlations that can be seen in the image:

**Age group 60-70 and heart attack output have a positive correlation (yellow). This means that patients in this age group are more likely to have a heart attack.

**Chest pain type and heart attack output have a positive correlation. This means that certain types of chest pain are more indicative of a heart attack.

**Max heart rate level (both mid and high) and heart attack output have a positive correlation. This means that a higher heart rate is associated with an increased risk of heart attack.

Upon a comprehensive analysis of the dataset, several prominent features have been identified as strongly correlated with the likelihood of a heart attack. These crucial indicators encompass Exercise-Induced Angina, specific Chest Pain Types (particularly Type 2 and Type 3, as highlighted in previous findings), Vessel Count, ST Slope, Gender, Thalassemia, Max Heart Rate Level (High), Old Peak Type-2, Type-3, and other chest pain characteristics. These correlations unveil the intricate web of cardiovascular risk factors, spanning exercise-induced symptoms, chest pain attributes, vascular involvement, electrocardiographic patterns, demographic variables, and considerations related to age.

Also their are other features like 'BP Level_Hypertension', 'BP Level_Hypertension STG1','BP Level_Hypertension STG2','Age Group 30-40', Old peek Type1 ,'Fasting bloodsugar',etc. shown neutral correlation with target variable. We can consider them to reduce curse of dimentionality by eleminating them according to their significance

5. Conclusions¶

Numerical Variables Analysis:

Age: Most individuals in the dataset are relatively younger, with a slight negative skewness. The risk of a heart attack tends to increase with age.

Resting Blood Pressure: The distribution leans towards higher values, indicating some individuals with significantly higher blood pressure.

Cholesterol: The majority have higher cholesterol values, with a skewed distribution towards higher levels.

Max Heart Rate: Slight tendency towards lower heart rates, but overall a typical distribution.

Old Peak: The distribution appears somewhat uniform, with a prevalent absence of old peaks during stress testing.

Categorical Variables Analysis:

Gender: More males in the dataset, but females show a higher proportion of heart attacks.

Chest Pain Type (cp): Type 2 chest pain has a higher likelihood of heart attacks.

Fasting Blood Sugar (fbs): No clear effect on heart attacks; distribution is relatively balanced.

Resting Electrocardiographic Results (restecg): Type 1 is associated with a higher likelihood of heart attacks.

Exercise-Induced Angina (exng): Type 0 (no exercise-induced angina) is more prevalent in individuals with heart attacks.

ST Slope (slp): ST slope type 2 is associated with a higher likelihood of heart attacks.

Vessels Count (caa): Vessels count type 0 is more frequent in individuals with heart attacks.

Thalassemia (thall): Type 2 thalassemia is associated with a higher likelihood of heart attacks.

Heart Attack Output (target): Slight imbalance with more occurrences indicating a higher chance of a heart attack.

6. Future Directions¶

Health-Centric Approach

Wellness Monitoring: Implement continuous health monitoring systems for individuals, integrating wearable devices and mobile applications. Track vital signs, physical activity, and lifestyle choices to create personalized health profiles.

Predictive Health Models: Develop advanced machine learning models that predict potential health risks, including heart attacks, based on continuous health data. Integrate real-time monitoring to provide timely alerts and preventive recommendations.

Behavioral Interventions: Design behavioral interventions using technology to encourage healthier lifestyles. Provide personalized recommendations for physical activity, nutrition, and stress management.

Telemedicine Integration: Expand telemedicine services to offer remote consultations and health check-ups. Utilize digital platforms for regular health assessments and consultations.

Community Health Initiatives: Launch community-based health programs to raise awareness about cardiovascular health. Conduct regular health screenings and educational workshops to empower communities.

Genetic Health Profiling: Explore genetic testing to identify predispositions to cardiovascular conditions. Use genetic data to personalize health recommendations and interventions.

Remote Health Monitoring: Implement remote monitoring for chronic health conditions, allowing healthcare providers to track patients' well-being. Utilize technology to transmit health data securely and enable timely medical interventions.

Integrated Health Platforms: Develop integrated health platforms that connect individuals, healthcare providers, and health data analytics. Facilitate seamless sharing of health information for comprehensive care.

Preventive Healthcare Policies: Advocate for policies that prioritize preventive healthcare measures. Encourage insurance coverage for preventive screenings, wellness programs, and digital health technologies.

Research Collaborations: Foster collaborations between healthcare providers, tech companies, and research institutions. Conduct interdisciplinary research to advance health technologies and interventions.

By embracing these health-centric future directions, we can shift towards a proactive and personalized healthcare paradigm, focusing on prevention, early detection, and holistic well-being. This approach has the potential to improve overall health outcomes and reduce the burden of cardiovascular diseases on individuals and healthcare systems.

THANK YOU !¶

In [ ]: